Table of Contents¶

0. Context

1. Import the Dataset and Explore the Data
Check data contents, provide descriptive statistics, and check for incoherencies in the data.
Explore data visually and extract relevant insights.
Explain your rationale and findings.
Do not forget to analyze multivariate relationships.
    1.1 Importing Libraries
    1.2 Loading and Reading the Dataset
    1.3 Descriptive Statistics
    1.4 Incoherencies
    1.5 Exploring Data Visually
    1.6 Multivariate Relationships

2. Clean and Pre-process the Data
Are there any missing values? Take action to handle them.
Check the dataset for outliers and pre-process them. Justify your decisions.
Deal with categorical variables.
Review current features and create extra features if needed. Explain your steps.
Perform data scaling. Explain the reasoning behind your choices.
    2.1 Missing Values
    2.2 Duplicates
    2.3 Outliers
    2.4 Categorical Data
    2.5 Aggregations
    2.6 Feature Engineering
    2.7 Data Scaling

3. Feature Selection
Define and implement an unambiguous strategy for feature selection.
Use methods discussed in the course.
Present and justify your final selection.
    3.1 Filter Methods
        3.1.1 Univariate Variables
        3.1.2 Correlation Indices
        3.1.3 Chi-Square
    3.2 Wrapper Methods
        3.2.1 RFE
    3.3 Embedded Methods
        3.3.1 Lasso
    3.4 Final Insights

4. Build a Simple Model and Assess the Performance
Identify the type of the problem and select the relevant algorithms.
Use cross-validation to assess performance. Which metrics did you use and why?
Train at least 1 model using the train dataet with a macro F1-score of more than 0.6.
et with a macro F1-score of more than 0.6.
et with a macro F1-score of more than 0.6.
a macro F1-score of more than 0.6.

ataset with a macro F1-score of more than 0.6. ataset with a macro F1-score of more than 0.6. ataset with a macro F1-score of more than 0.6. dataset with a macro F1-score of more than 0.6. dataset with a macro F1-score of more than 0.6. taset with a macro F1-score of more than 0.6. [2 ith a macro F1-score of more than 0.6.

0. Context ¶

The New York Workers’ Compensation Board (train_data) administers and regulates workers’ compensation, disability, and other workers’ benefits.
train_data is responsible for assembling and deciding on claims whenever it becomes aware of a workplace injury. Since 2000, the train_data has assembled and reviewed more than 5 million claims. However, manually reviewing all claims is an arduous and time-consuming process. For that reason, the train_data has reached out to Nova IMS to assist them in the creation of a model that can automate the decision-making whenever a new claim is received.

Our task is to create a classification model that can accurately predict the train_data’s final decision on what type o injury (Claim Injury Type) should be given to a caim.l To do that, the train_data has provided labelled data with all claims assembled betwee 2020 and 22.02 <b.

Import the Dataset and Explore the Data ¶

1.1 Importing Libraries ¶

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil
import warnings
%config InlineBackend.figure_format = 'retina' # optionally, you can change 'svg' to 'retina'
sns.set()
warnings.filterwarnings('ignore')

1.2 Loading and Reading the Dataset ¶

In [3]:
train_data = pd.read_csv('train_data.csv', sep = ',', low_memory=False)   #sep is good to seperate data
pd.set_option('display.max_columns', None) #to be able too see all columns
train_data.head(5)
Out[3]:
Accident Date Age at Injury Alternative Dispute Resolution Assembly Date Attorney/Representative Average Weekly Wage Birth Year C-2 Date C-3 Date Carrier Name Carrier Type Claim Identifier Claim Injury Type County of Injury COVID-19 Indicator District Name First Hearing Date Gender IME-4 Count Industry Code Industry Code Description Medical Fee Region OIICS Nature of Injury Description WCIO Cause of Injury Code WCIO Cause of Injury Description WCIO Nature of Injury Code WCIO Nature of Injury Description WCIO Part Of Body Code WCIO Part Of Body Description Zip Code Agreement Reached WCB Decision Number of Dependents
0 2019-12-30 31.0 N 2020-01-01 N 0.00 1988.0 2019-12-31 NaN NEW HAMPSHIRE INSURANCE CO 1A. PRIVATE 5393875 2. NON-COMP ST. LAWRENCE N SYRACUSE NaN M NaN 44.0 RETAIL TRADE I NaN 27.0 FROM LIQUID OR GREASE SPILLS 10.0 CONTUSION 62.0 BUTTOCKS 13662 0.0 Not Work Related 1.0
1 2019-08-30 46.0 N 2020-01-01 Y 1745.93 1973.0 2020-01-01 2020-01-14 ZURICH AMERICAN INSURANCE CO 1A. PRIVATE 5393091 4. TEMPORARY WYOMING N ROCHESTER 2020-02-21 F 4.0 23.0 CONSTRUCTION I NaN 97.0 REPETITIVE MOTION 49.0 SPRAIN OR TEAR 38.0 SHOULDER(S) 14569 1.0 Not Work Related 4.0
2 2019-12-06 40.0 N 2020-01-01 N 1434.80 1979.0 2020-01-01 NaN INDEMNITY INSURANCE CO OF 1A. PRIVATE 5393889 4. TEMPORARY ORANGE N ALBANY NaN M NaN 56.0 ADMINISTRATIVE AND SUPPORT AND WASTE MANAGEMEN... II NaN 79.0 OBJECT BEING LIFTED OR HANDLED 7.0 CONCUSSION 10.0 MULTIPLE HEAD INJURY 12589 0.0 Not Work Related 6.0
3 NaN NaN NaN 2020-01-01 NaN NaN NaN NaN NaN NaN NaN 957648180 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2019-12-30 61.0 N 2020-01-01 N NaN 1958.0 2019-12-31 NaN STATE INSURANCE FUND 2A. SIF 5393887 2. NON-COMP DUTCHESS N ALBANY NaN M NaN 62.0 HEALTH CARE AND SOCIAL ASSISTANCE II NaN 16.0 HAND TOOL, UTENSIL; NOT POWERED 43.0 PUNCTURE 36.0 FINGER(S) 12603 0.0 Not Work Related 1.0

Metadata¶

Claim Dates
Accident Date Injury date of the claim.
Assembly Date The date the claim was first assembled.
C-2 Date Date of receipt of the Employer's Report of Work-Related Injury/Illness or equivalent (formerly Form C-2).
C-3 Date Date Form C-3 (Employee Claim Form) was received.
First Hearing Date Date the first hearing was held on a claim at a train_data hearing location. A blank date means the claim has not yet had a hearing held.




Worker Demographics
Age at Injury Age of injured worker when the injury occurred.
Birth Year The reported year of birth of the injured worker.
Gender The reported gender of the injured worker.
Zip Code The reported ZIP code of the injured worker’s home address.



Claim and Case Information
Alternative Dispute Resolution Adjudication processes external to the Board.
Attorney/Representative Is the claim being represented by an Attorney?
Claim Identifier Unique identifier for each claim, assigned by train_data.
Carrier Name Name of primary insurance provider responsible for providing workers’ compensation coverage to the injured worker’s employer.
Carrier Type Type of primary insurance provider responsible for providing workers’ compensation coverage.
Average Weekly Wage The wage used to calculate workers’ compensation, disability, or paid leave wage replacement benefits.





Location and Region
County of Injury Name of the New York County where the injury occurred.
District Name Name of the train_data district office that oversees claims for that region or area of the state.
Medical Fee Region Approximate region where the injured worker would receive medical service.


Incident and Injury Details
COVID-19 Indicator Indication that the claim may be associated with COVID-19.
IME-4 Count Number of IME-4 forms received per claim. The IME-4 form is the “Independent Examiner's Report of Independent Medical Examination” form.

Industry Classification
Industry Code NAICS code and descriptions are available at https://www.naics.com/search-naics-codes-by-industry/.
Industry Code Description 2-digit NAICS industry code description used to classify businesses according to their economic activity.

Injury Descriptions and Codes
OIICS Nature of Injury Description The OIICS nature of injury codes & descriptions are available at https://www.bls.gov/iif/oiics_manual_2007.pdf.
WCIO Cause of Injury Code The WCIO cause of injury codes & descriptions are available at https://www.wcio.org/Active%20PNC/WCIO_Cause_Table.pdf.
WCIO Cause of Injury Description See description of field above.
WCIO Nature of Injury Code The WCIO nature of injury codes are available at https://www.wcio.org/Active%20PNC/WCIO_Nature_Table.pdf.
WCIO Nature of Injury Description See description of field above.
WCIO Part Of Body Code The WCIO part of body codes & descriptions are available at https://www.wcio.org/Active%20PNC/WCIO_Part_Table.pdf.
WCIO Part Of Body Description See description of field above.






Claim Outcomes
Agreement Reached Binary variable: Yes if there is an agreement without the involBement of the train_data; otherwise unknown at the start of a claim.
train_data Decision Multiclass variable: Decision of the train_data relative to the claim; "Accident" indicates a workplace accident, and "Occupational Disease" indicates illness from the workplace, both of which require train_data deliberation and may be unknown at the claim's start.
Claim Injury Type Main target variable: Deliberation of the train_data relative to benefits awarded to the claim, with numbering indicating severity.
aim, with numbering indicating severity.




1.3 Descriptive Statistics ¶

Shape¶

In [4]:
train_data.shape
Out[4]:
(593471, 33)

Change in Datatypes:¶

float -> integer
Age at Injury
Birth Year
IME-4 Count
Number of Dependents
Agreement Reached

float -> object
OIICS Nature of Injury Description

Object -> Dates
C-2 Date
C-3 Date
First Hearing Date
Accident Date
Assembly Dates

Change in datatypes¶

Data conversion to integer:

In [5]:
train_data_to_int = ['Age at Injury', 'Birth Year', 'IME-4 Count', 'Number of Dependents', 'Agreement Reached']

for col in train_data_to_int:
    # Convert to numeric and handle NaNs by keeping them as NaN, then cast to Int64
    train_data[col] = pd.to_numeric(train_data[col], errors='coerce').astype("Int64")

Data conversion to object:

In [6]:
train_data_to_obj = ['OIICS Nature of Injury Description']

for col in train_data_to_obj:
    train_data[col] = train_data[col].astype('object')
In [7]:
train_data_float_to_int = ['WCIO Cause of Injury Code','WCIO Nature of Injury Code','WCIO Part Of Body Code']

for col in train_data_float_to_int:
    # Convert the column to numeric, then to Int64 (nullable integer type)
    train_data[col] = pd.to_numeric(train_data[col], errors='coerce').astype('Int64')
In [8]:
train_data_to_obj = ['OIICS Nature of Injury Description']

for col in train_data_to_obj:
    train_data[col] = train_data[col].astype('object')

Categorical Data from number to object/category:

In [9]:
train_data_num_to_object= ['Industry Code', 'Agreement Reached', 'WCIO Cause of Injury Code', 
                    'WCIO Nature of Injury Code', 'WCIO Part Of Body Code','Claim Identifier']

for col in train_data_num_to_object:
# Convert the column to an object while keeping NaNs
    train_data[col] = train_data[col].astype('object')

Dates from object to datetime format:

In [10]:
# Convert dates treated as objects to datetime format
train_data['C-2 Date'] = pd.to_datetime(train_data['C-2 Date'], errors='coerce')
train_data['C-3 Date'] = pd.to_datetime(train_data['C-3 Date'], errors='coerce')
train_data['Accident Date'] = pd.to_datetime(train_data['Accident Date'], errors='coerce')
train_data['First Hearing Date'] = pd.to_datetime(train_data['First Hearing Date'], errors='coerce')
train_data['Assembly Date'] = pd.to_datetime(train_data['Assembly Date'], errors='coerce')
In [11]:
train_data.dtypes
Out[11]:
Accident Date                         datetime64[ns]
Age at Injury                                  Int64
Alternative Dispute Resolution                object
Assembly Date                         datetime64[ns]
Attorney/Representative                       object
Average Weekly Wage                          float64
Birth Year                                     Int64
C-2 Date                              datetime64[ns]
C-3 Date                              datetime64[ns]
Carrier Name                                  object
Carrier Type                                  object
Claim Identifier                              object
Claim Injury Type                             object
County of Injury                              object
COVID-19 Indicator                            object
District Name                                 object
First Hearing Date                    datetime64[ns]
Gender                                        object
IME-4 Count                                    Int64
Industry Code                                 object
Industry Code Description                     object
Medical Fee Region                            object
OIICS Nature of Injury Description            object
WCIO Cause of Injury Code                     object
WCIO Cause of Injury Description              object
WCIO Nature of Injury Code                    object
WCIO Nature of Injury Description             object
WCIO Part Of Body Code                        object
WCIO Part Of Body Description                 object
Zip Code                                      object
Agreement Reached                             object
WCB Decision                                  object
Number of Dependents                           Int64
dtype: object
In [12]:
train_data.dtypes.astype(str).value_counts().sort_values(ascending=False)
Out[12]:
object            23
datetime64[ns]     5
Int64              4
float64            1
Name: count, dtype: int64

Column with unexpected negative values¶

In [13]:
train_data[train_data['WCIO Part Of Body Code'] < 0]
Out[13]:
Accident Date Age at Injury Alternative Dispute Resolution Assembly Date Attorney/Representative Average Weekly Wage Birth Year C-2 Date C-3 Date Carrier Name Carrier Type Claim Identifier Claim Injury Type County of Injury COVID-19 Indicator District Name First Hearing Date Gender IME-4 Count Industry Code Industry Code Description Medical Fee Region OIICS Nature of Injury Description WCIO Cause of Injury Code WCIO Cause of Injury Description WCIO Nature of Injury Code WCIO Nature of Injury Description WCIO Part Of Body Code WCIO Part Of Body Description Zip Code Agreement Reached WCB Decision Number of Dependents
128 2019-12-19 42 N 2020-01-01 Y 1093.74 1977 2019-12-31 2020-01-14 PUBLIC EMPLOYERS RISK MGMT. 3A. SELF PUBLIC 5393974 4. TEMPORARY DUTCHESS N ALBANY 2020-08-27 M 6 92.0 PUBLIC ADMINISTRATION II NaN 99 OTHER - MISCELLANEOUS, NOC 52 STRAIN OR TEAR -9 MULTIPLE 12578 1 Not Work Related 6
235 2019-09-13 36 N 2020-01-02 Y 1500.00 1983 2020-02-07 2019-12-27 POLICE, FIRE, SANITATION 3A. SELF PUBLIC 5393785 5. PPD SCH LOSS SUFFOLK N HAUPPAUGE 2020-03-12 F 3 92.0 PUBLIC ADMINISTRATION IV NaN 81 STRUCK OR INJURED, NOC 10 CONTUSION -9 MULTIPLE 11704 0 Not Work Related 5
790 2019-12-12 57 N 2020-01-03 Y 764.35 1962 2020-01-03 2020-01-09 PROPERTY AND CASUALTY 1A. PRIVATE 5395679 5. PPD SCH LOSS BRONX N NYC 2020-02-18 M <NA> 31.0 MANUFACTURING IV NaN 29 ON SAME LEVEL 52 STRAIN OR TEAR -9 MULTIPLE 10466 0 Not Work Related 4
1006 2018-01-06 50 N 2020-01-03 N 0.00 1967 2020-01-03 NaT NEW HAMPSHIRE INSURANCE CO 1A. PRIVATE 5395006 2. NON-COMP BRONX N NYC NaT M <NA> 23.0 CONSTRUCTION UK NaN 97 REPETITIVE MOTION 59 ALL OTHER SPECIFIC INJURIES, NOC -9 MULTIPLE 07726 0 Not Work Related 0
2276 2019-12-26 33 N 2020-01-06 Y 670.00 <NA> 2020-01-06 2020-12-08 SYRACUSE, CITY OF 3A. SELF PUBLIC 5396338 4. TEMPORARY ONONDAGA N SYRACUSE 2021-02-16 M 9 92.0 PUBLIC ADMINISTRATION II NaN 56 LIFTING 52 STRAIN OR TEAR -9 MULTIPLE 13208 0 Not Work Related 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
593433 2022-11-23 30 N 2022-12-30 Y 0.00 1992 2023-01-06 2022-12-21 WESCO INSURANCE COMPANY 1A. PRIVATE 6164364 2. NON-COMP QUEENS N NYC NaT F <NA> 44.0 RETAIL TRADE IV NaN 26 FROM LADDER OR SCAFFOLDING 52 STRAIN OR TEAR -9 MULTIPLE NaN 0 Not Work Related 5
593438 2022-05-03 52 N 2022-12-30 N 1555.03 1970 2022-12-29 NaT ONONDAGA COUNTY SELF INS DIV 3A. SELF PUBLIC 6165182 4. TEMPORARY ONONDAGA N SYRACUSE NaT F 2 92.0 PUBLIC ADMINISTRATION II NaN 97 REPETITIVE MOTION 78 CARPAL TUNNEL SYNDROME -9 MULTIPLE 13215 0 Not Work Related 3
593439 2022-12-26 30 N 2022-12-30 Y 1234.19 1992 2022-12-30 NaT TONAWANDA, TOWN OF 3A. SELF PUBLIC 6165293 4. TEMPORARY ERIE N BUFFALO 2024-05-21 M 1 92.0 PUBLIC ADMINISTRATION I NaN 79 OBJECT BEING LIFTED OR HANDLED 90 MULTIPLE PHYSICAL INJURIES ONLY -9 MULTIPLE 14150 0 Not Work Related 3
593442 2022-12-22 53 N 2022-12-30 Y 0.00 1969 2022-12-30 NaT INDEMNITY INS. OF N AMERICA 1A. PRIVATE 6165175 2. NON-COMP NASSAU N NYC NaT F 1 72.0 ACCOMMODATION AND FOOD SERVICES IV NaN 29 ON SAME LEVEL 52 STRAIN OR TEAR -9 MULTIPLE 11570 0 Not Work Related 1
593467 2022-12-13 72 N 2022-12-31 N 0.00 1950 2022-12-31 NaT TECHNOLOGY INSURANCE CO. INC. 1A. PRIVATE 6165075 2. NON-COMP SULLIVAN N BINGHAMTON NaT F <NA> 48.0 TRANSPORTATION AND WAREHOUSING I NaN 25 FROM DIFFERENT LEVEL (ELEVATION) 90 MULTIPLE PHYSICAL INJURIES ONLY -9 MULTIPLE 12779 0 Not Work Related 3

42011 rows × 33 columns

Columns that should have matching number of unique values:¶

WCIO Part Of Body Code<> WCIO Part Of Body Description
Industry Code<> Industry Code Description
WCIO Cause of Injury Code<> WCIO Cause of Injury Description

Lets analyse what happened

WCIO Part Of Body Code<> WCIO Part Of Body Description

In [14]:
train_data[['WCIO Part Of Body Code', 'WCIO Part Of Body Description']].nunique()
#there are 3 more codes than descriptions
Out[14]:
WCIO Part Of Body Code           57
WCIO Part Of Body Description    54
dtype: int64
In [15]:
# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
body_uniques = train_data.drop_duplicates(subset=['WCIO Part Of Body Code'])[['WCIO Part Of Body Code', 'WCIO Part Of Body Description']]

# Conta a frequência de cada descrição na lista sem duplicatas de código
body_repeats = body_uniques['WCIO Part Of Body Description'].value_counts()

# Filtra para mostrar apenas as descrições que se repetem
body_repeats[body_repeats > 1]
Out[15]:
WCIO Part Of Body Description
DISC           2
SOFT TISSUE    2
SPINAL CORD    2
Name: count, dtype: int64
In [16]:
body_uniques[body_uniques['WCIO Part Of Body Description'].isin(['DISC', 'SOFT TISSUE', 'SPINAL CORD'])][['WCIO Part Of Body Code', 'WCIO Part Of Body Description']]
#23: spinal cord
#18: soft tissue
#43: disc
Out[16]:
WCIO Part Of Body Code WCIO Part Of Body Description
29 23 SPINAL CORD
41 18 SOFT TISSUE
45 25 SOFT TISSUE
113 43 DISC
151 22 DISC
2761 47 SPINAL CORD

Industry Code<> Industry Code Description

In [17]:
train_data[['Industry Code', 'Industry Code Description']].nunique()
Out[17]:
Industry Code                24
Industry Code Description    20
dtype: int64
In [18]:
# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
industry_uniques = train_data.drop_duplicates(subset=['Industry Code'])[['Industry Code', 'Industry Code Description']]

# Conta a frequência de cada descrição na lista sem duplicatas de código
industry_repeats = industry_uniques['Industry Code Description'].value_counts()

# Filtra para mostrar apenas as descrições que se repetem 
industry_repeats[industry_repeats > 1]
Out[18]:
Industry Code Description
MANUFACTURING                     3
RETAIL TRADE                      2
TRANSPORTATION AND WAREHOUSING    2
Name: count, dtype: int64
In [19]:
industry_uniques[industry_uniques['Industry Code Description'].isin(['MANUFACTURING', 'RETAIL TRADE', 'TRANSPORTATION AND WAREHOUSING'])][['Industry Code', 'Industry Code Description']]
#44: Retail Trade
#31 Manufacturing
#48 TRANSPORTATION AND WAREHOUSING
Out[19]:
Industry Code Industry Code Description
0 44.0 RETAIL TRADE
16 31.0 MANUFACTURING
17 48.0 TRANSPORTATION AND WAREHOUSING
18 45.0 RETAIL TRADE
31 49.0 TRANSPORTATION AND WAREHOUSING
65 33.0 MANUFACTURING
149 32.0 MANUFACTURING

WCIO Cause of Injury Code<> WCIO Cause of Injury Description

In [20]:
train_data[['WCIO Cause of Injury Code','WCIO Cause of Injury Description']].nunique()
Out[20]:
WCIO Cause of Injury Code           77
WCIO Cause of Injury Description    74
dtype: int64
In [21]:
# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
cinjury_uniques = train_data.drop_duplicates(subset=['WCIO Cause of Injury Code'])[['WCIO Cause of Injury Code','WCIO Cause of Injury Description']]

# Conta a frequência de cada descrição na lista sem duplicatas de código
cinjury_repeats = cinjury_uniques['WCIO Cause of Injury Description'].value_counts()

# Filtra para mostrar apenas as descrições que se repetem
cinjury_repeats[cinjury_repeats > 1]
Out[21]:
WCIO Cause of Injury Description
OBJECT BEING LIFTED OR HANDLED    3
REPETITIVE MOTION                 2
Name: count, dtype: int64
In [22]:
cinjury_uniques[cinjury_uniques['WCIO Cause of Injury Description'].isin(['OBJECT BEING LIFTED OR HANDLED', 'REPETITIVE MOTION'])][['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]
#97: Repetitive Motion (replace 94)
#79: OBJECT BEING LIFTED OR HANDLED (replace 17,66)
Out[22]:
WCIO Cause of Injury Code WCIO Cause of Injury Description
1 97 REPETITIVE MOTION
2 79 OBJECT BEING LIFTED OR HANDLED
90 17 OBJECT BEING LIFTED OR HANDLED
141 66 OBJECT BEING LIFTED OR HANDLED
2953 94 REPETITIVE MOTION

Columns with both numbers and string as values¶

Carrier Type

In [23]:
train_data['Carrier Type'].unique()
#1A. PRIVATE
#2A. SIF, -> STATE INSURANCE FUND (acc to google:stands for Subsequent Injury Fund??)
#3A. SELF PUBLIC
#4A. SELF PRIVATE
#5A. SPECIAL FUND - CONS. COMM. (SECT. 25-A)
#5C. SPECIAL FUND - POI CARRIER train_data MENANDS
#5D. SPECIAL FUND - UNKNOWN
#NAN
#UNKNOWN
Out[23]:
array(['1A. PRIVATE', nan, '2A. SIF', '4A. SELF PRIVATE',
       '3A. SELF PUBLIC', 'UNKNOWN', '5D. SPECIAL FUND - UNKNOWN',
       '5A. SPECIAL FUND - CONS. COMM. (SECT. 25-A)',
       '5C. SPECIAL FUND - POI CARRIER WCB MENANDS'], dtype=object)

Claim Injury Type

In [24]:
train_data['Claim Injury Type'].unique()
#1. CANCELLED
#2. NON-COMP
#3. MED ONLY
#4. TEMPORARY
#5. PPD SCH LOSS
#6. PPD NSL
#7. PTD
#8. DEATH
Out[24]:
array(['2. NON-COMP', '4. TEMPORARY', nan, '3. MED ONLY',
       '5. PPD SCH LOSS', '6. PPD NSL', '1. CANCELLED', '8. DEATH',
       '7. PTD'], dtype=object)

1.5 Exploring Data Visually ¶

Numerical Data Visualization¶

For numerical data, we plotted an histogram showing the data distribution:

this visualization is not ideal for Average Weekly Wage.

In [25]:
# Select numerical attributes with continuous values
numeric_attributes = train_data.loc[:, ['Age at Injury','Average Weekly Wage', 'Birth Year','IME-4 Count','Number of Dependents']]

# Set the number of subplots
num_attributes = len(numeric_attributes.columns)
num_cols = 3  # Set the number of columns for the subplots
num_rows = (num_attributes // num_cols) + (num_attributes % num_cols > 0)  # Calculate rows needed

# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 4))
axes = axes.flatten()  # Flatten the 2D array of axes

# Loop through each numerical attribute and plot its distribution
for i, column in enumerate(numeric_attributes.columns):
    sns.histplot(numeric_attributes[column], bins=30, kde=True, ax=axes[i])  # KDE adds a density curve
    axes[i].set_title(f'Distribution of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()
No description has been provided for this image

We can conclude that:¶

  • only 'Age at Injury' follows a normal distribution
In [26]:
categorical_columns = train_data.loc[:, ['Alternative Dispute Resolution', 'Attorney/Representative', 'Carrier Type', 'Claim Injury Type', 'County of Injury',
       'COVID-19 Indicator', 'District Name', 'Gender',
       'Industry Code Description', 'Medical Fee Region',
       'OIICS Nature of Injury Description',
       'WCIO Cause of Injury Description', 'WCIO Nature of Injury Description',
       'WCIO Part Of Body Description', 'Agreement Reached',
       'WCB Decision']]

# Filter only categorical columns with non-empty data
valid_categorical_columns = [
    col for col in categorical_columns 
    if train_data[col].notna().any()  # Check if the column has at least one non-NaN value
]

# Plot value counts for each valid categorical column
for col in valid_categorical_columns:
    plt.figure(figsize=(16, 8))
    value_counts = train_data[col].value_counts()

    if not value_counts.empty:  # Check if there are values to plot
        value_counts.plot(kind="bar")
        plt.title(f'Value Counts of {col}')
        plt.xlabel(col)
        plt.ylabel('Count')
        plt.show()
    else:
        print(f"Skipping column '{col}' as it contains no valid data.")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [27]:
# Define age categories
bins = [0, 15, 18, 31, 42, 54, 75, 117]
labels = ['0-15', '16-18', '19-31', '32-42', '43-54', '55-71', '+75']

# Create a new column for age categories
train_data['Age Category'] = pd.cut(train_data['Age at Injury'], bins=bins, labels=labels, right=False)
age_distribution = train_data['Age Category'].value_counts()

# Plot the pie chart
plt.figure(figsize=(14, 10))  # Set the figure size

age_distribution.plot(kind='pie', autopct='%1.1f%%', colors=['skyblue', 'lightgreen', 'coral', 'orange', 'purple', 'yellow', 'pink'])

# Customize the plot
plt.title('Age at Injury')
plt.ylabel('')  # Remove the default y-label (which shows "Age Category")

# Show the plot
plt.show()
No description has been provided for this image
In [28]:
corr = train_data[numeric_attributes].corr(method="pearson").round(2)
mask_annot = np.absolute(corr.values) >= 0.45
annot = np.where(mask_annot, corr.values, np.full(corr.shape,""))
fig = plt.figure(figsize=(20, 15))

sns.heatmap(data=corr, 
            annot=annot,
            fmt='s',
            vmin=-1, vmax=1, center=0,
            square=True, linewidths=.5,
            cmap='PiYG')
plt.show()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\blocks.py:2328, in ExtensionBlock.fillna(self, value, limit, inplace, downcast, using_cow, already_warned)
   2327 try:
-> 2328     new_values = self.values.fillna(
   2329         value=value, method=None, limit=limit, copy=copy
   2330     )
   2331 except TypeError:
   2332     # 3rd party EA that has not implemented copy keyword yet

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:266, in BaseMaskedArray.fillna(self, value, method, limit, copy)
    265             new_values = self[:]
--> 266         new_values[mask] = value
    267 else:

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:314, in BaseMaskedArray.__setitem__(self, key, value)
    313 else:
--> 314     value = self._validate_setitem_value(value)
    315     self._data[key] = value

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:305, in BaseMaskedArray._validate_setitem_value(self, value)
    301     # TODO: unsigned checks
    302 
    303 # Note: without the "str" here, the f-string rendering raises in
    304 #  py38 builds.
--> 305 raise TypeError(f"Invalid value '{str(value)}' for dtype {self.dtype}")

TypeError: Invalid value 'False' for dtype Int64

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 corr = train_data[numeric_attributes].corr(method="pearson").round(2)
      2 mask_annot = np.absolute(corr.values) >= 0.45
      3 annot = np.where(mask_annot, corr.values, np.full(corr.shape,""))

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\frame.py:4089, in DataFrame.__getitem__(self, key)
   4087 # Do we have a (boolean) DataFrame?
   4088 if isinstance(key, DataFrame):
-> 4089     return self.where(key)
   4091 # Do we have a (boolean) 1d indexer?
   4092 if com.is_bool_indexer(key):

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\generic.py:10984, in NDFrame.where(self, cond, other, inplace, axis, level)
  10977             warnings.warn(
  10978                 _chained_assignment_warning_method_msg,
  10979                 FutureWarning,
  10980                 stacklevel=2,
  10981             )
  10983 other = common.apply_if_callable(other, self)
> 10984 return self._where(cond, other, inplace, axis, level)

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\generic.py:10660, in NDFrame._where(self, cond, other, inplace, axis, level, warn)
  10654 with warnings.catch_warnings():
  10655     warnings.filterwarnings(
  10656         "ignore",
  10657         "Downcasting object dtype arrays",
  10658         category=FutureWarning,
  10659     )
> 10660     cond = cond.fillna(fill_value)
  10661 cond = cond.infer_objects(copy=False)
  10663 msg = "Boolean array expected for the condition, not {dtype}"

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\generic.py:7434, in NDFrame.fillna(self, value, method, axis, inplace, limit, downcast)
   7432         new_data = result._mgr
   7433     else:
-> 7434         new_data = self._mgr.fillna(
   7435             value=value, limit=limit, inplace=inplace, downcast=downcast
   7436         )
   7437 elif isinstance(value, ABCDataFrame) and self.ndim == 2:
   7438     new_data = self.where(self.notna(), value)._mgr

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\base.py:186, in DataManager.fillna(self, value, limit, inplace, downcast)
    182 if limit is not None:
    183     # Do this validation even if we go through one of the no-op paths
    184     limit = libalgos.validate_limit(None, limit=limit)
--> 186 return self.apply_with_block(
    187     "fillna",
    188     value=value,
    189     limit=limit,
    190     inplace=inplace,
    191     downcast=downcast,
    192     using_cow=using_copy_on_write(),
    193     already_warned=_AlreadyWarned(),
    194 )

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    361         applied = b.apply(f, **kwargs)
    362     else:
--> 363         applied = getattr(b, f)(**kwargs)
    364     result_blocks = extend_blocks(applied, result_blocks)
    366 out = type(self).from_blocks(result_blocks, self.axes)

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\blocks.py:2334, in ExtensionBlock.fillna(self, value, limit, inplace, downcast, using_cow, already_warned)
   2331 except TypeError:
   2332     # 3rd party EA that has not implemented copy keyword yet
   2333     refs = None
-> 2334     new_values = self.values.fillna(value=value, method=None, limit=limit)
   2335     # issue the warning *after* retrying, in case the TypeError
   2336     #  was caused by an invalid fill_value
   2337     warnings.warn(
   2338         # GH#53278
   2339         "ExtensionArray.fillna added a 'copy' keyword in pandas "
   (...)
   2345         stacklevel=find_stack_level(),
   2346     )

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:266, in BaseMaskedArray.fillna(self, value, method, limit, copy)
    264         else:
    265             new_values = self[:]
--> 266         new_values[mask] = value
    267 else:
    268     if copy:

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:314, in BaseMaskedArray.__setitem__(self, key, value)
    312     self._mask[key] = True
    313 else:
--> 314     value = self._validate_setitem_value(value)
    315     self._data[key] = value
    316     self._mask[key] = False

File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:305, in BaseMaskedArray._validate_setitem_value(self, value)
    300         return value
    301     # TODO: unsigned checks
    302 
    303 # Note: without the "str" here, the f-string rendering raises in
    304 #  py38 builds.
--> 305 raise TypeError(f"Invalid value '{str(value)}' for dtype {self.dtype}")

TypeError: Invalid value 'False' for dtype Int64
In [ ]:
box_plot_features = ['Age at Injury', 'Average Weekly Wage', 'Claim Injury Type', 'First Hearing Date', 'Accident Date', 'Assembly Date']


sns.set()
fig, axes = plt.subplots(2, ceil(len(box_plot_features) / 2), figsize=(40, 11))
for ax, feat in zip(axes.flatten(), box_plot_features):
    sns.boxplot(x=train_data[feat], ax=ax)
    
title = "Numeric Variables' Box Plots"
plt.suptitle(title)
plt.show()
No description has been provided for this image
In [ ]:
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")

# Gráfico de boxplot para visualizar a idade em cada tipo de lesão
sns.boxplot(x='Claim Injury Type', y='Age at Injury', data=train_data)

# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Age at Injury')
plt.title('Distribution of age by Claim Injury Type')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhorar a legibilidade

# Exibe o gráfico
plt.show()
No description has been provided for this image
In [ ]:
# Configuração do tamanho e estilo
sns.set(style="whitegrid")

# Scatter Plot: Average Weekly Wage x Claim Injury Type
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Average Weekly Wage', y='Claim Injury Type', data=train_data, alpha=0.5)
plt.xlabel('Average Weekly Wage')
plt.ylabel('Claim Injury Type')
plt.title('Scatter Plot: Claim Injury Type vs Average Weekly Wage')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhor legibilidade
plt.show()

# Box Plot: Average Weekly Wage x Claim Injury Type
plt.figure(figsize=(12, 6))
sns.boxplot(x='Average Weekly Wage', y='Claim Injury Type', data=train_data)
plt.xlabel('Average Weekly Wage')
plt.ylabel('Claim Injury Type')
plt.title('Box Plot: Average Weekly Wage vs Claim Injury Type')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhor legibilidade
plt.show()
No description has been provided for this image
No description has been provided for this image
In [ ]:
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")

# Gráfico de barras para contar o número de ocorrências de cada tipo de lesão por gênero
sns.countplot(x='Claim Injury Type', hue='Gender', data=train_data)

# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by Gender')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhor legibilidade

# Exibe o gráfico
plt.show()


# Calcular as proporções de Claim Injury Type por Gender
proporcoes = train_data.groupby(['Claim Injury Type', 'Gender']).size().reset_index(name='Count')
proporcoes['Proporção'] = proporcoes.groupby('Claim Injury Type')['Count'].transform(lambda x: x / x.sum())

# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")

# Gráfico de barras para visualizar a proporção
sns.barplot(x='Claim Injury Type', y='Proporção', hue='Gender', data=proporcoes)

# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Proporcion')
plt.title('Proporcion of Claim Injury Type by Gender')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhor legibilidade

# Exibe o gráfico
plt.show()
No description has been provided for this image
No description has been provided for this image
In [ ]:
cat1 = 'Claim Injury Type'
cat2 = 'Gender'

fig, axes = plt.subplots(1,2, figsize=(12,4))

catpc_df = train_data.groupby([cat1, cat2])[cat2].size().unstack()
catpc_df.plot.bar(stacked=True, ax=axes[0])
axes[0].set_title('Claim Injury Type vs Gender, Absolute counts')
axes[0].legend([], frameon=False)

catpc_df2 = train_data.groupby([cat1, cat2])[cat2].size() / train_data.groupby([cat1])[cat2].size() 
catpc_df2.unstack().plot.bar(stacked=True, ax=axes[1])
axes[1].set_title('Claim Injury Type vs Gender, Relative counts')
axes[1].legend(loc=(1.02,0))

plt.show()
No description has been provided for this image
In [ ]:
# Contar as ocorrências de Claim Injury Type por Industry Code
heatmap_data = train_data.groupby(['Industry Code Description', 'Claim Injury Type']).size().unstack(fill_value=0)

# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")

# Criação do heatmap
sns.heatmap(heatmap_data, annot=True, fmt='g', cmap='Blues', cbar_kws={'label': 'Count'})

# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Industry Code Description')
plt.title('Heatmap: Claim Injury Type by Industry Code Description')

# Exibe o gráfico
plt.show()
No description has been provided for this image
In [ ]:
# Contar o número de ocorrências de Claim Injury Type por Medical Fee Region
contagem = train_data.groupby(['Medical Fee Region', 'Claim Injury Type']).size().reset_index(name='Count')

# Calcular a soma total de ocorrências por Medical Fee Region
soma_por_regiao = contagem.groupby('Medical Fee Region')['Count'].sum().reset_index()

# Ordenar as regiões pela contagem total
soma_por_regiao = soma_por_regiao.sort_values(by='Count', ascending=False)

# Mapeia as Medical Fee Regions ordenadas
contagem['Medical Fee Region'] = pd.Categorical(contagem['Medical Fee Region'], categories=soma_por_regiao['Medical Fee Region'], ordered=True)

# Calcular a soma total de ocorrências por Claim Injury Type
soma_por_tipo = contagem.groupby('Claim Injury Type')['Count'].sum().reset_index()

# Ordenar os tipos de lesão pela contagem total
soma_por_tipo = soma_por_tipo.sort_values(by='Count', ascending=False)

# Mapeia os Claim Injury Types ordenados
contagem['Claim Injury Type'] = pd.Categorical(contagem['Claim Injury Type'], categories=soma_por_tipo['Claim Injury Type'], ordered=True)

# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")

# Gráfico de barras empilhadas
sns.barplot(x='Medical Fee Region', y='Count', hue='Claim Injury Type', data=contagem)

# Ajuste dos rótulos e título
plt.xlabel('Medical Fee Region')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by Medical Fee Region (Sorted)')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhorar a legibilidade

# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
No description has been provided for this image
In [ ]:
cat1 = 'Claim Injury Type'
cat2 = 'Medical Fee Region'

fig, axes = plt.subplots(1,2, figsize=(12,4))

catpc_df = train_data.groupby([cat1, cat2])[cat2].size().unstack()
catpc_df.plot.bar(stacked=True, ax=axes[0])
axes[0].set_title('Claim Injury Type vs Medical Fee Region, Absolute counts')
axes[0].legend([], frameon=False)

catpc_df2 = train_data.groupby([cat1, cat2])[cat2].size() / train_data.groupby([cat1])[cat2].size() 
catpc_df2.unstack().plot.bar(stacked=True, ax=axes[1])
axes[1].set_title('Claim Injury Type vs Medical Fee Region, Relative counts')
axes[1].legend(loc=(1.02,0))

plt.show()
No description has been provided for this image
In [ ]:
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")

# Contar o número de ocorrências de Claim Injury Type por WCIO Cause of Injury Description
contagem = train_data.groupby(['WCIO Cause of Injury Description', 'Claim Injury Type']).size().reset_index(name='Count')

# Calcular a soma total de ocorrências por WCIO Cause of Injury Description
soma_por_causa = contagem.groupby('WCIO Cause of Injury Description')['Count'].sum().reset_index()

# Selecionar as 10 descrições com mais valores
top10_causas = soma_por_causa.sort_values(by='Count', ascending=False).head(6)

# Filtrar o contagem para manter apenas as 10 descrições com mais valores
contagem_top10 = contagem[contagem['WCIO Cause of Injury Description'].isin(top10_causas['WCIO Cause of Injury Description'])]



# Gráfico de pontos
sns.stripplot(x='WCIO Cause of Injury Description', y='Count', hue='Claim Injury Type', data=contagem_top10, dodge=True, jitter=True)

# Ajuste dos rótulos e título
plt.xlabel('WCIO Cause of Injury Description (Top 10)')
plt.ylabel('Count')
plt.title('Scatter Plot: Claim Injury Type by WCIO Cause of Injury Description (Top 10)')
plt.xticks(rotation=90)  # Rotaciona os rótulos para melhorar a legibilidade

# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
No description has been provided for this image
In [ ]:
# Contar o número de ocorrências de Claim Injury Type por WCIO Cause of Injury Description
contagem = train_data.groupby(['WCIO Cause of Injury Description', 'Claim Injury Type']).size().reset_index(name='Count')

# Calcular a soma total de ocorrências por WCIO Cause of Injury Description
soma_por_causa = contagem.groupby('WCIO Cause of Injury Description')['Count'].sum().reset_index()

# Selecionar as 10 descrições com mais valores
top10_causas = soma_por_causa.sort_values(by='Count', ascending=False).head(10)

# Filtrar o contagem para manter apenas as 10 descrições com mais valores
contagem_top10 = contagem[contagem['WCIO Cause of Injury Description'].isin(top10_causas['WCIO Cause of Injury Description'])]

# Ordenar os dados pelo total de contagem de cada causa de lesão
contagem_top10 = contagem_top10.sort_values(by=['WCIO Cause of Injury Description', 'Claim Injury Type'])

# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")

# Gráfico de barras empilhadas
sns.barplot(x='WCIO Cause of Injury Description', y='Count', hue='Claim Injury Type', data=contagem_top10, 
            order=top10_causas['WCIO Cause of Injury Description'].tolist())

# Ajuste dos rótulos e título
plt.xlabel('WCIO Cause of Injury Description (Top 10)')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by WCIO Cause of Injury Description (Top 10)')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhorar a legibilidade

# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
No description has been provided for this image
In [ ]:
# Contar o número de ocorrências de Claim Injury Type por WCIO Part of Body Description
contagem_part_body = train_data.groupby(['WCIO Part Of Body Description', 'Claim Injury Type']).size().unstack(fill_value=0)

# Criar o heat map
plt.figure(figsize=(12, 15))
sns.heatmap(contagem_part_body, annot=True, fmt='d', cmap='YlGnBu', cbar=True)

# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('WCIO Part of Body Description')
plt.title('Heat Map: WCIO Part of Body Description vs Claim Injury Type')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhorar a legibilidade
plt.yticks(rotation=0)   # Mantém os rótulos na horizontal

# Exibe o gráfico
plt.show()
No description has been provided for this image
In [ ]:
# Contar o número de ocorrências de Claim Injury Type por COVID-19 Indicator
contagem_covid = train_data.groupby(['COVID-19 Indicator', 'Claim Injury Type']).size().reset_index(name='Count')

# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")

# Gráfico de barras
sns.barplot(x='COVID-19 Indicator', y='Count', hue='Claim Injury Type', data=contagem_covid)

# Ajuste dos rótulos e título
plt.xlabel('COVID-19 Indicator')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by COVID-19 Indicator')
plt.xticks(rotation=45)  # Rotaciona os rótulos para melhorar a legibilidade

# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
No description has been provided for this image
In [ ]:
cat1 = 'COVID-19 Indicator'
cat2 = 'Claim Injury Type'

fig, axes = plt.subplots(1,2, figsize=(12,4))

catpc_df = train_data.groupby([cat1, cat2])[cat2].size().unstack()
catpc_df.plot.bar(stacked=True, ax=axes[0])
axes[0].set_title('Claim Injury Type vs COVID-19 Indicator, Absolute counts')
axes[0].legend([], frameon=False)

catpc_df2 = train_data.groupby([cat1, cat2])[cat2].size() / train_data.groupby([cat1])[cat2].size() 
catpc_df2.unstack().plot.bar(stacked=True, ax=axes[1])
axes[1].set_title('Claim Injury Type vs COVID-19 Indicator, Relative Counts')
axes[1].legend(loc=(1.02,0))

plt.show()
No description has been provided for this image
In [ ]:
# Calculate the date differences
train_data['Hearing_to_Accident'] = (train_data['First Hearing Date'] - train_data['Accident Date']).dt.days
train_data['Assembly_to_Accident'] = (train_data['Assembly Date'] - train_data['Accident Date']).dt.days
train_data['Hearing_to_Assembly'] = (train_data['First Hearing Date'] - train_data['Assembly Date']).dt.days

# Set up the scatter plot
plt.figure(figsize=(12, 8))

# Scatter plot of 'Hearing_to_Accident' vs 'Assembly_to_Accident' colored by 'Claim Injury Type'
sns.scatterplot(
    data=train_data,
    x='Hearing_to_Accident',
    y='Assembly_to_Accident',
    hue='Claim Injury Type',
    palette='viridis',
    alpha=0.7
)

# Adding titles and labels
plt.title('Scatter Plot of Date Differences by Claim Injury Type')
plt.xlabel('Days from Accident to First Hearing')
plt.ylabel('Days from Accident to Assembly Date')
plt.legend(title='Claim Injury Type')
plt.grid(True)

plt.show()
No description has been provided for this image

Categorical Data Visualization¶

For categorical data, we plotted a bar chart showing frequency of unique values:

this visualization is not ideal for following attributes, that present more dense values
Zip Code
Carrier Name

[ ] change visual for these attributes

1.6 Multivariate Relationships ¶

Gender <> Day of Year

In [ ]:
train_data['Day of Year'] = train_data['Accident Date'].dt.dayofyear
In [ ]:
# Map Gender to numerical values (e.g., 0 for Male, 1 for Female)
train_data['Gender Numeric'] = train_data['Gender'].map({'M': 0, 'F': 1})

Useful Aggregations¶

In [ ]:
train_data[train_data['C-3 Date'].isna()]['Claim Injury Type'].value_counts()
Out[ ]:
Claim Injury Type
2    252951
4     69533
3     45065
5     11065
1      6867
6       859
8       423
7        18
Name: count, dtype: int64
In [ ]:
train_data[train_data['IME-4 Count'].isna()]['Claim Injury Type'].value_counts()
Out[ ]:
Claim Injury Type
2    285575
4     76840
3     55113
1     12115
5     10996
8       369
6       211
7         4
Name: count, dtype: int64
In [ ]:
train_data[train_data['C-2 Date'].isna()]['Claim Injury Type'].value_counts()
Out[ ]:
Claim Injury Type
2    7502
1    6710
4     218
3     108
5      21
7       1
Name: count, dtype: int64
In [ ]:
# Get value counts for 'Claim Injury Type' where 'C-2 Date' and 'C-3 Date' are NaN
c2_counts = train_data[train_data['C-2 Date'].isna()]['Claim Injury Type'].value_counts()
c3_counts = train_data[train_data['C-3 Date'].isna()]['Claim Injury Type'].value_counts()
c2_c3 = train_data[(train_data['C-3 Date'].isna()) & (train_data['C-2 Date'].isna())]['Claim Injury Type'].value_counts()

# Total counts for normalization (relative frequency)
total_c2 = c2_counts.sum()
total_c3 = c3_counts.sum()
totalc2_c3 = c2_c3.sum()

# Create a DataFrame combining absolute and relative frequencies
claim_injury_summary = pd.DataFrame({
    'without C-2 Form': c2_counts,
    'without C-3 Form': c3_counts,
    'Without both forms': c2_c3,
    'Both forms RELATIVE FREQ': (c2_c3 / totalc2_c3) * 100,
    'C-2 Relative Frequency (%)': (c2_counts / total_c2) * 100,
    'C-3 Relative Frequency (%)': (c3_counts / total_c3) * 100,
}).fillna(0)  # Fill NaN with 0 for categories missing in either column

# Display the DataFrame
claim_injury_summary
Out[ ]:
without C-2 Form without C-3 Form Without both forms Both forms RELATIVE FREQ C-2 Relative Frequency (%) C-3 Relative Frequency (%)
Claim Injury Type
1 6710.0 6867 2706.0 42.573946 46.085165 1.775423
2 7502.0 252951 3529.0 55.522341 51.524725 65.399024
3 108.0 45065 21.0 0.330396 0.741758 11.651296
4 218.0 69533 99.0 1.557583 1.497253 17.977357
5 21.0 11065 0.0 0.000000 0.144231 2.860792
6 0.0 859 0.0 0.000000 0.000000 0.222090
7 1.0 18 1.0 0.015733 0.006868 0.004654
8 0.0 423 0.0 0.000000 0.000000 0.109364

Multivariate Relationships ¶

In [ ]:
# Calculate mean days for each industry code
# Calculate mean days for each industry code without resetting the index
mean_days_by_industry = train_data.groupby('Industry Code')['Days from Accident to Assembly'].mean()

# Plot the bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=mean_days_by_industry.index, y=mean_days_by_industry.values, palette='viridis')

# Add labels and title
plt.xlabel('Industry Code')
plt.ylabel('Average Days from Accident to Assembly')
plt.title('Average Days from Accident to Assembly by Industry Code')
plt.xticks(rotation=45)
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[160], line 3
      1 # Calculate mean days for each industry code
      2 # Calculate mean days for each industry code without resetting the index
----> 3 mean_days_by_industry = X_train.groupby('Industry Code')['Days from Accident to Assembly'].mean()
      5 # Plot the bar plot
      6 plt.figure(figsize=(12, 6))

NameError: name 'X_train' is not defined
In [ ]:
# Calculate mean days for each industry code
# Calculate mean days for each industry code without resetting the index
mean_days_by_injury = train_data.groupby('WCIO Cause of Injury Code')['Days from Accident to Assembly'].mean()

# Plot the bar plot
plt.figure(figsize=(16, 6))
sns.barplot(x=mean_days_by_industry.index, y=mean_days_by_industry.values, palette='viridis')

# Add labels and title
plt.xlabel('Cause of Injury Code')
plt.ylabel('Average Days from Accident to Assembly')
plt.title('Average Days from Accident to Assembly by Cause of Injury Code')
plt.xticks(rotation=50)
plt.show()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[175], line 3
      1 # Calculate mean days for each industry code
      2 # Calculate mean days for each industry code without resetting the index
----> 3 mean_days_by_injury = train_data.groupby('WCIO Cause of Injury Code')['Days from Accident to Assembly'].mean()
      5 # Plot the bar plot
      6 plt.figure(figsize=(16, 6))

File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/groupby/generic.py:1951, in DataFrameGroupBy.__getitem__(self, key)
   1944 if isinstance(key, tuple) and len(key) > 1:
   1945     # if len == 1, then it becomes a SeriesGroupBy and this is actually
   1946     # valid syntax, so don't raise
   1947     raise ValueError(
   1948         "Cannot subset columns with a tuple with more than one element. "
   1949         "Use a list instead."
   1950     )
-> 1951 return super().__getitem__(key)

File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/base.py:244, in SelectionMixin.__getitem__(self, key)
    242 else:
    243     if key not in self.obj:
--> 244         raise KeyError(f"Column not found: {key}")
    245     ndim = self.obj[key].ndim
    246     return self._gotitem(key, ndim=ndim)

KeyError: 'Column not found: Days from Accident to Assembly'
In [ ]:
# Calculate mean days for each industry code
# Calculate mean days for each industry code without resetting the index
mean_days_by_injury = train_data.groupby('WCIO Nature of Injury Code')['Days from Accident to Assembly'].mean()

# Plot the bar plot
plt.figure(figsize=(16, 6))
sns.barplot(x=mean_days_by_industry.index, y=mean_days_by_industry.values, palette='viridis')

# Add labels and title
plt.xlabel('Nature of Injury Code')
plt.ylabel('Average Days from Accident to Assembly')
plt.title('Average Days from Accident to Assembly by Nature of Injury Code')
plt.xticks(rotation=50)
plt.show()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[179], line 3
      1 # Calculate mean days for each industry code
      2 # Calculate mean days for each industry code without resetting the index
----> 3 mean_days_by_injury = train_data.groupby('WCIO Nature of Injury Code')['Days from Accident to Assembly'].mean()
      5 # Plot the bar plot
      6 plt.figure(figsize=(16, 6))

File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/groupby/generic.py:1951, in DataFrameGroupBy.__getitem__(self, key)
   1944 if isinstance(key, tuple) and len(key) > 1:
   1945     # if len == 1, then it becomes a SeriesGroupBy and this is actually
   1946     # valid syntax, so don't raise
   1947     raise ValueError(
   1948         "Cannot subset columns with a tuple with more than one element. "
   1949         "Use a list instead."
   1950     )
-> 1951 return super().__getitem__(key)

File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/base.py:244, in SelectionMixin.__getitem__(self, key)
    242 else:
    243     if key not in self.obj:
--> 244         raise KeyError(f"Column not found: {key}")
    245     ndim = self.obj[key].ndim
    246     return self._gotitem(key, ndim=ndim)

KeyError: 'Column not found: Days from Accident to Assembly'

Conditional Mode - by Industry Code¶

In [ ]:
# Prepare the data for a stacked bar chart
cause_industry_counts = train_data.groupby(['WCIO Cause of Injury Code', 'Industry Code']).size().unstack(fill_value=0)

# Plot stacked bar chart
cause_industry_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Distribution of WCIO Cause of Injury Code by Industry Code', fontsize=14)
plt.xlabel('WCIO Cause of Injury Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Industry Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Prepare the data for a stacked bar chart
industry_cause_counts = train_data.groupby(['Industry Code', 'WCIO Cause of Injury Code']).size().unstack(fill_value=0)

# Plot stacked bar chart
industry_cause_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Distribution of Industry Code by WCIO Cause of Injury Code', fontsize=14)
plt.xlabel('Industry Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='WCIO Cause of Injury Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
/var/folders/2t/s2f8bw197wgg7dqr1kyxyqn80000gn/T/ipykernel_11503/950163664.py:11: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
  plt.tight_layout()
No description has been provided for this image
In [ ]:
# Prepare the data for a stacked bar chart
body_industry_counts = train_data.groupby(['WCIO Part Of Body Code', 'Industry Code']).size().unstack(fill_value=0)

# Plot stacked bar chart
body_industry_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='plasma')
plt.title('Distribution of WCIO Part Of Body Code by Industry Code', fontsize=14)
plt.xlabel('WCIO Part Of Body Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Industry Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Prepare the data for a stacked bar chart
industry_body_counts = train_data.groupby(['Industry Code', 'WCIO Part Of Body Code']).size().unstack(fill_value=0)

# Plot stacked bar chart
industry_body_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='plasma')
plt.title('Distribution of Industry Code by WCIO Part Of Body Code', fontsize=14)
plt.xlabel('Industry Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='WCIO Part Of Body Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
/var/folders/2t/s2f8bw197wgg7dqr1kyxyqn80000gn/T/ipykernel_11503/2326239920.py:11: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
  plt.tight_layout()
No description has been provided for this image
In [ ]: